34 research outputs found
Polyglot: Distributed Word Representations for Multilingual NLP
Distributed word representations (word embeddings) have recently contributed
to competitive performance in language modeling and several NLP tasks. In this
work, we train word embeddings for more than 100 languages using their
corresponding Wikipedias. We quantitatively demonstrate the utility of our word
embeddings by using them as the sole features for training a part of speech
tagger for a subset of these languages. We find their performance to be
competitive with near state-of-art methods in English, Danish and Swedish.
Moreover, we investigate the semantic features captured by these embeddings
through the proximity of word groupings. We will release these embeddings
publicly to help researchers in the development and enhancement of multilingual
applications.Comment: 10 pages, 2 figures, Proceedings of Conference on Computational
Natural Language Learning CoNLL'201
The Expressive Power of Word Embeddings
We seek to better understand the difference in quality of the several
publicly released embeddings. We propose several tasks that help to distinguish
the characteristics of different embeddings. Our evaluation of sentiment
polarity and synonym/antonym relations shows that embeddings are able to
capture surprisingly nuanced semantics even in the absence of sentence
structure. Moreover, benchmarking the embeddings shows great variance in
quality and characteristics of the semantics captured by the tested embeddings.
Finally, we show the impact of varying the number of dimensions and the
resolution of each dimension on the effective useful features captured by the
embedding space. Our contributions highlight the importance of embeddings for
NLP tasks and the effect of their quality on the final results.Comment: submitted to ICML 2013, Deep Learning for Audio, Speech and Language
Processing Workshop. 8 pages, 8 figure